Skip to content

Add clean markdown generation for LLM-friendly page content#723

Open
nearestnabors wants to merge 3 commits intomainfrom
feat/clean-markdown-for-llms
Open

Add clean markdown generation for LLM-friendly page content#723
nearestnabors wants to merge 3 commits intomainfrom
feat/clean-markdown-for-llms

Conversation

@nearestnabors
Copy link
Contributor

@nearestnabors nearestnabors commented Feb 4, 2026

Currently our public markdown files are littered with janky JSX components that LLMs can't read and link to HTML files instead of other markdown files. This PR FIXES that.

Summary

  • Generate clean markdown from rendered HTML pages during build time
  • Update /api/markdown endpoint to serve pre-generated clean markdown (falls back to real-time conversion)
  • Add CopyPageOverride component that intercepts the "Copy page" button to fetch clean markdown from the API
  • Add frontmatter (title, description) extracted from HTML <meta> tags to generated markdown files
  • Add public/_markdown/ to .gitignore

Changes

  • scripts/generate-clean-markdown.ts: New script that runs a production server, fetches rendered HTML, and converts to clean markdown using Turndown
  • app/api/markdown/[[...slug]]/route.ts: Updated to serve pre-generated markdown first, with fallback
  • app/_components/copy-page-override.tsx: Client component that intercepts copy button clicks
  • app/_components/custom-layout.tsx: Includes the CopyPageOverride component
  • scripts/generate-llmstxt.ts: Uses pre-generated clean markdown if available
  • package.json: Added build scripts for generating clean markdown

Test plan

  • Run pnpm build to verify the build succeeds
  • Run pnpm generate:clean-markdown to verify markdown generation works
  • Visit any page and click "Copy page" to verify it copies clean markdown
  • Check that /api/markdown/en/home.md returns markdown with frontmatter

🤖 Generated with Claude Code

- Generate clean markdown from rendered HTML pages during build
- Update /api/markdown endpoint to serve pre-generated clean markdown
- Add CopyPageOverride component to fetch clean markdown on "Copy page"
- Add frontmatter (title, description) extracted from HTML meta tags
- Fix linting issues with top-level regex and simplified logic

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@vercel
Copy link

vercel bot commented Feb 4, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docs Ready Ready Preview, Comment Feb 4, 2026 11:24pm

Request Review

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

@evantahler evantahler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the test has been running for a few hours... there might be something preventing the processs from exiting (a dangling prommise?)

I'd also love to see a test for one of of the clean markdown files that initially had some HTML that was removed successfully

- Add explicit process.exit(0) to ensure script terminates after completion
  (event listeners on spawned server kept event loop alive)
- Add validation test for HTML element removal (script, style, svg, nav, footer, aside)
- Refactor validateGeneratedContent into smaller helper functions to reduce complexity
- Increase MIN_INTEGRATION_LINKS threshold from 5 to 10

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@teallarson
Copy link
Contributor

Screenshot 2026-02-05 at 4 36 05 PM I'm not sure I know what a janky jsx component is in this context/if any jsx is acceptable.

Is this what the homepage's markdown should show?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants